-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] Add replica queue length caching to replica scheduler #42943
Conversation
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Benchmark resultsHTTP no-op latencyBaseline
With caching
HTTP throughput (using
|
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
@ray-project/ray-serve the changes in this PR are ready for an initial review, but please note the TODOs in the description (most notably, I still have a lot of tests to write). |
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach LGTM, thanks for making this more efficient Ed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work so far. I haven't looked at the tests, but I left some comments on the implementation.
chosen_replica_id = t.replica_id | ||
queue_len = t.result() | ||
result.append((t.replica, queue_len)) | ||
self._replica_queue_len_cache.update(r.replica_id, queue_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] We calculate the timestamp for all the responses upon update rather than when we actually received the response. This means if there's one really slow replica, then all the replica's timestamps could actually be off by queue_len_response_deadline_s
.
Since queue_len_response_deadline_s
is pretty low, this shouldn't be a major concern, but if it becomes larger, then the timestamps may be pretty inaccurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep you're right -- I have a follow-up item to generate all of these timestamps on the replica where possible
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
@GeneDer @shrekris-anyscale addressed comments and finished all of the TODOs from the description, PTAL. |
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor non-blocking comments, LGTM!
self._pending_requests_to_schedule.append(pending_request) | ||
else: | ||
index = 0 | ||
for pr in self._pending_requests_to_fulfill: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocker nitpick, we can probably just use enumerate
and so we don't need to track and increment index separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, want to mention, since now that we can assume the queue is sorted. We can utilize python's bisect.insort 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice I hadn't heard of bisect.insort
. Looks interesting. I think in this case it might not be ideal because in the common case we should be inserting at or near the front of the queue when going through this slower is_retry
path.
Signed-off-by: Edward Oakes <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving the code changes. I haven't had a chance to review the unit tests.
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
…oject#42943) Adds caching logic to avoid actively probing replicas for every request. This is integrated into the existing PowerOfTwoChoicesReplicaScheduler so it can reuse much of the same policy and mechanism (e.g., locality-aware and model multiplexing-aware candidate selection). The benefits of this change are: - Enables strict enforcement of max_concurrent_queries. - Reduces proxy-side overhead for scheduling requests. - Reduces latency for scheduling requests (in the "happy path," there's no extra RTT). The changes are as follows: - All calls to replicas are now streaming calls, and the first message returned is a system message. The replica uses this message to return its current queue length and reject requests if it's at capacity (max_concurrent_queries). If the replica rejects, the request scheduling procedure will be retried. - The replica scheduler maintains a local cache of replica queue lengths. Entries in this cache have a timeout (currently set to 10 seconds). The cache is updated by (1) actively probing replicas and (2) the system response messages mentioned above. - When scheduling a request, we first attempt to choose the best replica based on the queue lengths in the cache. If none of the candidates have entries in the cache that are below max_concurrent_queries, we fall back to active probing (as before this PR). There are two feature flags introduced to control this behavior (both currently off by default): - `RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE` - `RAY_SERVE_ENABLE_STRICT_MAX_CONCURRENT_QUERIES` (implicitly set by the above) --------- Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: Ratnopam Chakrabarti <[email protected]>
Why are these changes needed?
Adds caching logic to avoid actively probing replicas for every request. This is integrated into the existing
PowerOfTwoChoicesReplicaScheduler
so it can reuse much of the same policy and mechanism (e.g., locality-aware and model multiplexing-aware candidate selection).The benefits of this change are:
max_concurrent_queries
.The changes are as follows:
max_concurrent_queries
). If the replica rejects, the request scheduling procedure will be retried.max_concurrent_queries
, we fall back to active probing (as before this PR).There are two feature flags introduced to control this behavior (both currently off by default):
RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE
RAY_SERVE_ENABLE_STRICT_MAX_CONCURRENT_QUERIES
(implicitly set by the above)TODOs before merging:
ReplicaQueueLengthCache
.emplace_front
logic (avoid tail latencies).with_rejection
logic.TODOs for subsequent PRs (should file follow-up issues):
Related issue number
Closes #42946
Closes #42947
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.